Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the StringEncoder transformer #1159

Merged
merged 63 commits into from
Jan 27, 2025
Merged

Conversation

rcap107
Copy link
Contributor

@rcap107 rcap107 commented Nov 26, 2024

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

  • Testing
  • Parameter checking?
  • Default value for the PCA?
  • Docstrings
  • Deciding name of the features

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 5, 2024

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

@GaelVaroquaux
Copy link
Member

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 9, 2024

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

@GaelVaroquaux
Copy link
Member

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

@Vincent-Maladiere
Copy link
Member

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 9, 2024

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

image

@Vincent-Maladiere
Copy link
Member

That's very interesting!

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 17, 2024

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

I updated the doc page on the Encoders, but it was only to add the StringEncoder and a short summary of the different methods. Looking at the page, I think it would be better to expand on it with more detail for all encoders and maybe an explanation of the parameters, but that's something that would take way more effort (and definitely something for a separate PR).

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 17, 2024

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

image

My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced.

I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder.

@rcap107 rcap107 self-assigned this Dec 22, 2024
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rcap107! Here is a bunch of questions and nitpicks :)

Comment on lines +140 to +155
if (min_shape := min(X_out.shape)) >= self.n_components:
self.tsvd_ = TruncatedSVD(n_components=self.n_components)
result = self.tsvd_.fit_transform(X_out)
else:
warnings.warn(
f"The matrix shape is {(X_out.shape)}, and its minimum is "
f"{min_shape}, which is too small to fit a truncated SVD with "
f"n_components={self.n_components}. "
"The embeddings will be truncated by keeping the first "
f"{self.n_components} dimensions instead. "
)
# self.n_components can be greater than the number
# of dimensions of result.
# Therefore, self.n_components_ below stores the resulting
# number of dimensions of result.
result = X_out[:, : self.n_components].toarray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe L140 to L155 could be brought into a common utils with the text encoder, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not in this PR though

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenML downloads still fail and break the CI

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! Thanks a lot @rcap107 this is a great addition 🚀

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks @rcap107 :)

@jeromedockes jeromedockes merged commit d905cd1 into skrub-data:main Jan 27, 2025
25 checks passed
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 27, 2025

I'm doing a last review of this PR. It's great, I love it. A tiny comment: I think that I prefer the comparison plot in log; if people agree, I'll implement the change (I'll do something better than what's below)
image

@jeromedockes
Copy link
Member

fine by me!

@Vincent-Maladiere
Copy link
Member

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants